Here we examine whether publishing volume has an impact on overall, article, or place traffic, specifically whether total, average, or median traffic is affected by increased publishing volume.

TL;DR, it's somewhat inconclusive, but we're probably better off publishing about 6 places per day. As far as articles are concerned, we seem to perform better the more we publish, but only to a point. That points seems to be around 12 or 13 articles per day.


In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [6]:
df = pd.read_csv('All content.csv', index_col='Published',parse_dates=True)
df['count']=1
df = df[(df['Page Views'] > 200)]

In [7]:
df_resampled = df.resample('D',how ='sum')

Let's truncate the data to the period between June 1st, 2015 and March 4, 2016. This is to keep super old content out of the window while also eliminating any content less than 6 days old.


In [8]:
df_trunc = df_resampled.truncate(before='2015-06-01', after='2016-03-04')

In [9]:
df_trunc = df_trunc.dropna()

In [10]:
df_trunc = df_trunc[['Page Views', 'Social Actions', 'Social Referrals', 'Facebook Shares', 'count']]
df_trunc['mean']=df_trunc['Page Views']//df_trunc['count']

Here we plot the total page views, "PVs total", and the average page views based on how many pieces of content were published in a given day.

From this we see increasing returns for publishing more, but sparse data on the high end of the dataset.


In [11]:
df_trunc.plot(kind='scatter',x='count',y='Page Views',title='PVs total')
df_trunc.plot(kind='scatter',x='count',y='mean',title='Average PVs')
df_trunc.plot(kind='scatter',x='count',y='Facebook Shares',title='Total Facebook Shares')


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1197ce350>

In [12]:
df2 = pd.read_csv('All content.csv',index_col='Published',parse_dates=True)

df_articles = df2[(df2['Url'].str.contains('/articles/',na=False))]
df_places = df2[(df2['Url'].str.contains('/places/',na=False))]
df_articles['count']=1
df_places['count']=1


/Users/Mike/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/Mike/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [13]:
df_articles_resampled = df_articles.resample('D',how='sum')
df_articles_trunc = df_articles_resampled.truncate(before='2015-06-01', after='2016-03-04')
df_articles_trunc = df_articles_trunc.dropna()
df_articles_trunc = df_articles_trunc[['Page Views', 'Social Actions', 'Social Referrals', 'Facebook Shares', 'count']]
df_articles_trunc['mean']=df_articles_trunc['Page Views']//df_articles_trunc['count']

Articles

Here we plot the average and total pageviews for articles based on number of articles published per day. It looks like there is improvement in performance for a while, but then there is a drop-off when publishing of articles exceeds 13 / day.

But it's not a very strong correlation

In [14]:
df_articles_trunc.plot(kind='scatter',x='count',y='Page Views',title='Articles PVs total')
df_articles_trunc.plot(kind='scatter',x='count',y='mean',title='Articles Average PVs')
df_articles_trunc.plot(kind='scatter',x='count',y='Facebook Shares',title='Total Facebook Shares')


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x119766c90>

In [15]:
df_places_resampled = df_places.resample('D',how='sum')
df_places_trunc = df_places_resampled.truncate(before='2015-06-01', after='2016-03-04')
df_places_trunc = df_places_trunc.dropna()
df_places_trunc = df_places_trunc[['Page Views', 'Social Actions', 'Social Referrals', 'Facebook Shares', 'count']]
df_places_trunc['mean']=df_places_trunc['Page Views']//df_places_trunc['count']

Places

It appears that there is a very weak correction between total places published per day and either total or average Place performance.


In [16]:
df_places_trunc.plot(kind='scatter',x='count',y='Page Views',title='Places PVs total')
df_places_trunc.plot(kind='scatter',x='count',y='mean',title='Places Average PVs')
df_places_trunc.plot(kind='scatter',x='count',y='Facebook Shares',title='Total Facebook Shares')


Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1199a4fd0>

Median Article Traffic


In [17]:
df_articles_resampled2 = df_articles.resample('D',how='median')
df_articles_trunc2 = df_articles_resampled2.truncate(before='2015-06-01', after='2016-03-04')
df_articles_trunc2 = df_articles_trunc2.dropna()
df_articles_trunc2 = df_articles_trunc2[['Page Views', 'Social Actions', 'Social Referrals', 'Facebook Shares']]
df_articles_trunc2['count']=df_articles_trunc['count']

In [18]:
df_articles_trunc2.plot(kind='scatter',x='count',y='Page Views',title='Median Articles PVs')


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x11549ae90>

Again, there is nearly zero correlation between median article traffic per day and the volume of publishing

Let's look at Places, just to be sure


In [19]:
df_places_resampled2 = df_places.resample('D',how='median')
df_places_trunc2 = df_places_resampled2.truncate(before='2015-06-01', after='2016-03-04')
df_places_trunc2 = df_places_trunc2.dropna()
df_places_trunc2 = df_places_trunc2[['Page Views', 'Social Actions', 'Social Referrals', 'Facebook Shares']]
df_places_trunc2['count']=df_places_trunc['count']
df_places_trunc2.plot(kind='scatter',x='count',y='Page Views',title='Median Places PVs')


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x11d3d5d90>

It looks like there is actually a weak relationship between median place traffic and overall publishing volume. It seems optimal to publish around 6 places per day.

import statsmodels.formula.api as smf


In [22]:
import statsmodels.formula.api as smf
df_trunc


Out[22]:
Page Views Social Actions Social Referrals Facebook Shares count mean
Published
2015-06-01 16796 602 2700 91 7 2399
2015-06-02 15121 1596 3360 291 8 1890
2015-06-03 31562 447 1316 70 7 4508
2015-06-04 8289 630 2170 117 7 1184
2015-06-05 3070 62 181 11 7 438
2015-06-08 12709 1117 1886 193 11 1155
2015-06-09 8516 755 1206 100 10 851
2015-06-10 9727 232 236 29 8 1215
2015-06-11 16643 1774 5565 315 9 1849
2015-06-12 11254 1928 4839 409 8 1406
2015-06-15 17122 3473 7574 549 6 2853
2015-06-16 3815 174 70 19 7 545
2015-06-17 6870 1045 2058 217 9 763
2015-06-18 139729 19402 82383 3302 9 15525
2015-06-19 11084 2900 4758 434 7 1583
2015-06-22 4926 858 665 146 8 615
2015-06-23 7946 1101 1533 91 6 1324
2015-06-24 19839 2807 5596 435 7 2834
2015-06-25 12148 989 5419 128 9 1349
2015-06-26 95266 7429 56449 1306 8 11908
2015-06-29 6841 263 917 53 4 1710
2015-06-30 77753 6421 54201 1039 9 8639
2015-07-01 5562 1096 866 233 8 695
2015-07-02 8516 1233 2027 253 9 946
2015-07-03 7203 788 1292 151 4 1800
2015-07-06 13168 2210 3257 383 8 1646
2015-07-07 85092 14057 56996 2308 9 9454
2015-07-08 11425 1444 4428 291 9 1269
2015-07-09 9918 2249 2413 413 7 1416
2015-07-10 10207 1717 2693 289 7 1458
... ... ... ... ... ... ...
2016-02-04 76705 6446 27421 882 17 4512
2016-02-05 116980 7799 50287 1477 17 6881
2016-02-06 10747 807 3768 142 2 5373
2016-02-07 4885 530 1349 124 2 2442
2016-02-08 77578 19993 39979 3232 11 7052
2016-02-09 93031 18849 36325 2962 17 5472
2016-02-10 132885 26072 64903 4289 17 7816
2016-02-11 111512 15704 54962 2691 17 6559
2016-02-12 133127 16497 60692 2908 17 7831
2016-02-13 6085 429 2433 58 2 3042
2016-02-14 41886 2836 17159 513 3 13962
2016-02-15 85001 16350 38497 2379 16 5312
2016-02-16 223105 24597 147230 4245 18 12394
2016-02-17 116709 11023 33455 1998 21 5557
2016-02-18 107151 19095 40291 2906 20 5357
2016-02-19 279126 53864 152526 7301 21 13291
2016-02-20 28192 4099 12358 647 2 14096
2016-02-21 9923 793 3761 144 2 4961
2016-02-22 348047 25172 202742 4153 17 20473
2016-02-23 67093 14798 30778 1938 17 3946
2016-02-24 59589 15410 31228 2346 17 3505
2016-02-25 182339 48611 110455 7624 20 9116
2016-02-26 73464 7573 22560 1305 18 4081
2016-02-27 5680 498 1900 68 2 2840
2016-02-28 8983 938 3487 143 2 4491
2016-02-29 76747 20246 39611 2619 19 4039
2016-03-01 131868 14143 51002 2448 20 6593
2016-03-02 69039 12842 31688 2143 17 4061
2016-03-03 133869 10515 66790 2030 19 7045
2016-03-04 131507 34244 57390 5279 20 6575

218 rows × 6 columns


In [30]:
lm = smf.ols(formula="Page Views ~ count", data=df_trunc).fit()
lm.summary()


---------------------------------------------------------------------------
PatsyError                                Traceback (most recent call last)
<ipython-input-30-fee1528a1470> in <module>()
----> 1 lm = smf.ols(formula="['Page Views'] ~ count", data=df_trunc).fit()
      2 lm.summary()

/Users/Mike/anaconda/lib/python2.7/site-packages/statsmodels/base/model.pyc in from_formula(cls, formula, data, subset, *args, **kwargs)
    145         (endog, exog), missing_idx = handle_formula_data(data, None, formula,
    146                                                          depth=eval_env,
--> 147                                                          missing=missing)
    148         kwargs.update({'missing_idx': missing_idx,
    149                        'missing': missing})

/Users/Mike/anaconda/lib/python2.7/site-packages/statsmodels/formula/formulatools.pyc in handle_formula_data(Y, X, formula, depth, missing)
     63         if data_util._is_using_pandas(Y, None):
     64             result = dmatrices(formula, Y, depth, return_type='dataframe',
---> 65                                NA_action=na_action)
     66         else:
     67             result = dmatrices(formula, Y, depth, return_type='dataframe',

/Users/Mike/anaconda/lib/python2.7/site-packages/patsy/highlevel.pyc in dmatrices(formula_like, data, eval_env, NA_action, return_type)
    295     eval_env = EvalEnvironment.capture(eval_env, reference=1)
    296     (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 297                                       NA_action, return_type)
    298     if lhs.shape[1] == 0:
    299         raise PatsyError("model is missing required outcome variables")

/Users/Mike/anaconda/lib/python2.7/site-packages/patsy/highlevel.pyc in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
    154         return build_design_matrices(design_infos, data,
    155                                      NA_action=NA_action,
--> 156                                      return_type=return_type)
    157     else:
    158         # No builders, but maybe we can still get matrices

/Users/Mike/anaconda/lib/python2.7/site-packages/patsy/build.pyc in build_design_matrices(design_infos, data, NA_action, return_type, dtype)
    891                 name = factor_info.factor.name()
    892                 origin = factor_info.factor.origin
--> 893                 rows_checker.check(value.shape[0], name, origin)
    894                 if (have_pandas
    895                     and isinstance(value, (pandas.Series, pandas.DataFrame))):

/Users/Mike/anaconda/lib/python2.7/site-packages/patsy/build.pyc in check(self, seen_value, desc, origin)
    793                 # XX FIXME: this is a case where having discontiguous Origins
    794                 # would be useful...
--> 795                 raise PatsyError(msg, origin)
    796 
    797 def build_design_matrices(design_infos, data,

PatsyError: Number of rows mismatch between data argument and ['Page Views'] (218 versus 1)
    ['Page Views'] ~ count
    ^^^^^^^^^^^^^^

In [ ]: